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Abstract 

The neighbourhood-based Collaborative Filtering is a widely used method in recom- 
mender systems. However, the risks of revealing customers’ privacy during the process of 
filtering have attracted noticeable public concern recently. Specifically, /cNN attack dis¬ 
closes the target user’s sensitive information by creating k fake nearest neighbours by non¬ 
sensitive information. Among the current solutions against fcNN attack, the probabilistic 
methods showed a powerful privacy preserving effect. However, the existing probabilis¬ 
tic methods neither guarantee enough prediction accuracy due to the global randomness, 
nor provide assured security enforcement against /cNN attack. To overcome the problems 
of current probabilistic methods, we propose a novel approach, Partitioned Probabilistic 
Neighbour Selection, to ensure a required security guarantee while achieving the optimal 
prediction accuracy against fcNN attack. In this paper, we define the sum of k neighbours’ 
similarity as the accuracy metric a, the number of user partitions, across which we select 
the k neighbours, as the security metric /3. Differing from the present methods that glo¬ 
bally selected neighbours, our method selects neighbours from each group with exponential 
differential privacy to decrease the magnitude of noise. Theoretical and experimental ana¬ 
lysis show that to achieve the same security guarantee against fcNN attack, our approach 
ensures the optimal prediction accuracy. 


Keywords; Privacy Preserving, Differential Privacy, Neighbourhood-based Collaborative 
Filtering, Internet Commerce 
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1 Introduction 


Recommender systems predict customers’ potential preferences by aggregating history data 
and customers’ interests. Recently, the increasing importance of recommender systems in vari¬ 
ous Internet applications should be noticed. For example, Amazon has been receiving benefits 
for a decade from the recommender systems by providing personal recommendation to their 
customers, and Netflix posted a one million U.S. dollars award for improving their recommender 
system to make their business more profitable [8,12,23 . Currently, in recommender systems. 


Collaborative Filtering (CF) is a famous technology with three main popular algorithms 14 


i.e., neighbourhood-based methods [^, association rules based prediction [^, and matrix fac¬ 
torisation 1^. Among these algorithms, neighbourhood-based methods are the most widely 
used in the industry because of its easy implementation and high prediction accuracy. 

One of the most popular neighbourhood-based CF method is k Nearest Neighbour (/cNN) 
method which provides recommendations by aggregating the opinions of a user’s k nearest 
neighbours [^. Although feNN method efficiently presents good recommendation performance 
of accuracy, the risk of customers’ privacy disclosure during the process of filtering is a growing 
concern, e.g., the /cNN attack which exploits the property that the users are more similar 
when sharing same ratings on corresponding non-sensitive items to reveal user’s sensitive in¬ 
formation. Thus proposing an efficient privacy preserving neighbourhood-based CF algorithm 
against A:NN attack, which obtains trade-off between the system security and recommendation 
accuracy, has been a natural research problem. 

The literature in CF recommender systems has shown several approaches to preserve cus¬ 
tomers’ privacy. Generally, cryptographic methods, obfuscation, perturbation, randomised 
methods (including naive probabilistic methods and differential privacy methods) are ap¬ 
plied [^. Among them, cryptographic methods provide the most reliable security 


but the unnecessary computational cost cannot be ignored. Obfuscation methods 20,25 and 
Perturbation methods 


introduce designed random noise into the original matrix to pre¬ 
serve customers’ sensitive information; however the magnitude of noise is hard to calibrate in 
these two types of methods [7 27 . The naive probabilistic method [^ provides a similarity 
based weighted neighbour selection for the k neighbours. Similar to perturbation, McSherry et 
al. presented a naive differential privacy method which adds calibrated noise into the co- 
variance (similarity between users/items) matrix. Similar to the naive probabilistic neighbour 
selection |^, Zhu et al. [27] proposed a Private Neighbour CF to preserve privacy against fcNN 
attack by introducing differential privacy in selecting the k nearest neighbours randomly, then 
adding Laplace differential noise into covariance matrix. Although the methods in [T, 17, ^ 
successfully preserve users’ privacy against /cNN attack, the low prediction accuracy due to the 
global randomness should be remarked. Moreover, as privacy preserving CF recommendation 
algorithms, none of the existing randomised methods provide an assured security enforcement 
before the process of filtering. 

Contributions. In this paper, to overcome the problems of unsatisfactory prediction 
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accuracy and unassured security guarantee in the existing probabilistic approaches against 
feNN attack, we propose a novel method, Partitioned Probabilistic Neighbour Selection. The 
main contributions of this paper are: 

• We define performance metrics clearly in both prediction accuracy and system security 
to theoretically analyse the performance of privacy preserving CF method. Specifically, 
we define the sum of k neighbours’ similarity as the accuracy metric a, the number of 
user partitions, across which we select the k neighbours, as the security metric (3. 

• We propose a novel differential privacy preserving method. Partitioned Probabilistic 
Neighbonr Selection (PPNS), which achieves the optimal prediction accuracy a with 
a given desired system security (3 among all of the existing developments of randomised 
neighbourhood-based CF recommendation algorithms. 

• We show that, compared with the related methods, the proposed PPNS method performs 
consistently well across various experimental settings. For example, we compare the 
accuracy performance on different datasets; we design the experiments on both user- 
based and item-based neighbourhood-based CF; we examine the accuracy performance 
in the scenario with and without /cNN attack. 

Organisation. The rest of this paper is organised as follows: Firstly, in Section we discuss 
both the advantages and disadvantages in the existing privacy preserving methods on CF 
recommender systems. Then we introduce the relevant preliminaries in this paper in Section 

Afterwards, we present a classic attacking against neighbourhood-based CF recommender 
systems in Section]^ Next, we propose a novel differential privacy recommendation approach. 
Partitioned Probabilistic Neighbour Selection in Section In Section the experimental 
analysis of our approach on the performance of both recommendation accuracy and security 
are provided. Finally, we conclude with a summary in Section 


2 Related Work 

A noticeable number of literature has been published on privacy risks to preserve customers’ 
private data in recommender systems. In this section, we briefly discuss some of the research 
literature in privacy preserving CF recommender systems. 


2.1 Traditional Privacy Preserving CF Recommendation 


Amount of traditional privacy preserving methods have been developed in CF recommender 
systems [^, including cryptographic [9,19 , obfuscation [^ 25 , perturbation [^[^ and prob¬ 
abilistic methods [^. Erkin et al. [^ applied homomorphic encryption and secure multi-party 
computation in privacy preserving recommender systems, which allows users to jointly com¬ 
pute their data to receive recommendation without sharing the true data with other parties. 
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Nikolaenko et al. 19 combined a famous recommendation technique, matrix factorization, 
and a cryptographic method, garbled circuits, to provide recommendations without learning 
the real user ratings in database. The Cryptographic methods provide the highest guaran¬ 
tee for both prediction security ans system security by introducing encryption rather than 
adding noise to the original record. Unfortunately, unnecessary computational cost impacts 
its application in industry . Obfuscation and perturbation are two similar data processing 
methods. In particular, obfuscation methods aggregate a number of random noises with real 
users rating to preserve user’s sensitive information. Parameswaran et al. |20[ proposed an 
obfuscation framework which exchanges the sets of similar items before submitting the user 
data to CF server. Weinsberg et al. introduced extra reasonable ratings into user’s profile 
against inferring user’s sensitive information. Perturbation methods modify the user’s original 
ratings by a selected probability distribution before using these ratings. Particularly, Bilge 
et al. added uniform distribution noise to the real ratings before the utilisation of user’s 
rating in prediction process. While, Basu et al. regarded the deviation between two items 
as the adding noise. Both perturbation and obfuscation obtain good trade-off between pre¬ 
diction accuracy and system security due to the tiny data perturbation, but the magnitude 
of noise or the percentage of replaced ratings are not easy to be calibrated [7,27 . The naive 
probabilistic method [^ applied weighted sampling in the process of neighbour selection which 
preserves users’ privacy against feNN attack successfully, because of the perturbation of the 
final neighbour set; however, it cannot guarantee enough prediction accuracy due to the global 
randomness. Moreover, these traditional privacy preserving CF methods are unable to measure 
privacy levels against A:NN attack, thus impairing the credibility of the final recommendation 
result. 


2.2 Differential Privacy CF Recommendation 

As a well-known privacy definition, the differential privacy mechanism [^ has been applied in 
the research of privacy preserving recommender systems. For example, McSherry et al. [17| 
provided the first differential privacy neighbourhood-based CF recommendation algorithm. 
Actually, the naive differential privacy protects the neighbourhood-based CF recommender 
systems against fcNN attack successfully, as they added Laplace noise into the covariance 
matrix globally, so that the output neighbour set is no longer the original k neighbours {k 
nearest candidates). However, the prediction accuracy of their recommendation algorithm is 
decreased significantly due to the introduction of global noise. 

Another development of differential privacy neighbourhood-based CF method. Private 
Neighbour CF (PNCF), is proposed by [27] which inspires our work. They theoretically fixed 
the low prediction accuracy problem of naive probabilistic neighbour selection [^ by a trun¬ 
cated parameter A. As a differential privacy method, the selection weight in PNCF method is 
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measured by the following equation: 


uji = exp( 


4k X RS 


qaiU{Ua),Ui)), 


( 1 ) 


where e is differential privacy budge, q is the score function, RS is the Recommendation-Aware 
Sensitivity of score function q for any user pair Ui and uj, and U{ua) is the set of user UaS 
candidate list. For a user Ua, the score function q and its Recommendation-Aware Sensitivity 
are defined as: 

qa{U{ua),Ui) = sim{a,i), (2) 


RS = max 



^i,s • f'j,s 

(lkilllkjll-lk*ll 


) 

r 
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(3) 


where is user Ui's rating on item ts, sim{a, i) is the similarity between user Ua and Ui, ri is 
user Ui's average rating on every item, Sij is the set of all items co-rated by both users i and 
j, i.e., Sij = {s G ^ 0 k. Vj^s / 0}- 

Then, the PNCF method selects the k neighbours which include the candidates whose 
similarity is greater than {sirrik + A) and randomised candidates whose similarity is between 
{siruk + A) and {sirrik — A), where siruk denotes the similarity of the kth. candidate of a 
target user. Zhu et al. provided an equation to calculate the value of A, i.e. A = 
min(simfc, where p is a constant, 0 < p < 1. Once having the k neighbours set. 


Zhu et al. added Laplace differential noise in the final k neighbour’s similarity matrix to 
perturb the final prediction. Their experimental results showed better prediction performance 
than pT| . 

We observe that PNCF has two weaknesses. Firstly, it unnecessarily adds random noise 
in the process of filtering twice (one at neighbour selection stage, another at rating prediction 
stage), the extra randomness will decrease the prediction accuracy significantly. Secondly, 
the value of A may not be achievable. This is because the computation of A results in a 
good theoretical recommendation accuracy, but does not yield a good experimental prediction 


accuracy on the given test data sets in 27 against /cNN attack. So the PNCF method 127 


will actually be a method of naive Probabilistic Neighbour Selection and cannot guarantee 
enough recommendation accuracy. 

In conclusion, compared with cryptographic, obfuscation and perturbation privacy pre¬ 
serving methods, the probabilistic methods are more efficient. The existing Probabilistic solu¬ 
tions [T,17,^ on privacy preserving neighbourhood-based CF recommender systems applied 
different randomised strategies to improve the prediction accuracy, while ensure the security 
against /cNN attack by selecting the k neighbours across a target user’s partial/entire candidate 
list. However, they failed to guarantee enough prediction accuracy due to the introduction of 
global noise. Additionally, as privacy preserving CF recommendation algorithms, none of the 
existing randomised methods provide an assured security enforcement before the process of CF 
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recommendation against fcNN attack. Therefore, in this paper, we aim to propose a randomised 
privacy preserving neighbourhood-based CF recommendation algorithm which guarantees an 
assured security firstly, then achieves the optimal prediction accuracy with the assured security 
guarantee. 


3 Preliminaries 

In this section, we introduce the foundational concepts related with this paper in collaborative 
filtering, differential privacy, and Wallenius’ non-central hypergeometric distribution. 


3.1 k Nearest Neighbour Collaborative Filtering 


k Nearest Neighbour collaborative filtering is the most popular recommendation method in 
neighbourhood-based CF recommender systems, which predicts customer’s potential prefer¬ 
ences by aggregating the opinions of the k most similar neighbours [^. 

Neighbour Selection and Rating Prediction are two main steps in neighbourhood-based 
At the Neighbour Selection stage, k nearest candidates are selected from the target 


CF 27 


user UaS candidate list Sa, where similarities between Ua and any other users are calculated 
by similarity measurement metric. There are two famous similarity measurement metrics: the 
Pearson correlation coefficient and Cosine-based similarity [^. In this paper, we use the Cosine- 


based similarity 21 as the similarity measurement metric because of its lower complexity. 


sim{i,j) = 






EseSi ^IsJEs&S, ^Is 


(4) 


where sim{i,j) is the similarity between user Ui and Uj, ri^g is user rtj’s rating on item tg, 
Vi^s G 77., 77. is the user-item rating dataset, fj is user Ui's average rating on every item, Sij is 
the set of all items co-rated by both user Ui and Uj, i.e., Sij = {s G 5'|ri^s 7 ^ 0 &: rj^g 7 ^ 0 }, Si 
is the set of all items rated by user Ui, i.e., S'* = {s G 7 ^ 0}. 

At the stage of Rating Prediction, to predict the potential rating fax of user Ua on item tx, 
all ratings on tx of the k selected users (which are called neighbours) will be aggregated. For 
example, for user-based methods, the prediction of fax is shown as below: 


EmeN^iug) sim{a,i)ri^x 


(5) 


where, Nk{ua) is a sorted set which contains user Ua’s k nearest neighbours, Nk{ua) is sorted 
by similarity in a descending order, sim{a,i) is the Rh neighbour of Ua in N^^Ua)- 
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3.2 Differential Privacy 


Informally, differential privacy is a scheme that minimises the sensitivity of output for a 
given statistical operation on two different (differentiated in one record to protect) datasets. 
Specifically, differential privacy guarantees whether one specific record appears or does not 
appear in a database, the privacy mechanism will shield the specific record to the adversary. 
The strategy of differential privacy is adding a random calibrated noise to the result of a query 
function on the database. We say two datasets X and X' are neighbouring dataset, if they 
differ in only one record at most. A formal definition of Differential Privacy is shown as follows: 

Definition 1 (e-Differential Privacy [^). A randomised mechanism!' is e-differential privacy 
if for all neighbouring datasets X and X', and for all outcome sets S C Range{T), T satisfies: 
Pr[T(A) G S'] < exp{e) ■ Pr[T(A') G S], where e is a privacy budget. 

Definition 2 (Exponential Differential Privacy Mechanism [18] ). Given a score function of a 
database X, q{X, x), which reflects the score of query respond x. The exponential mechanism T 
provides e-differential privacy, ifT{X) = {the probability of a query respond x oc exp(^^^2sf^)}’ 
where Aq = max|g(X, x) — q{X',x)\, denotes the sensitivity of q. 


3.3 Wallenius’ Non-central Hypergeometric Distribution 

Briefly, Wallenius’ Non-central Hypergeometric Distribution is a distribution of weighted sam¬ 
pling without replacement [^. We assume there are c categories in the population, category 
i contains m* individuals. All the individuals in category i have the same weight w*. The 
probability of an individual is sampled at a given draw is proportional to its weight Wj. 

In this paper, we use the following properties of Wallenius’ Non-central Hypergeometric Dis¬ 
tribution to find the optimal prediction accuracy neighbour selection with a given security guar¬ 
antee against /cNN attack. gave the approximated solution to the mean fj,2, ■ ■ ■, h-c) 

of £c = (xi,X 2 ,... jXc), where Xi denotes the number of individuals sampled from category i 
by Wallenius’ Non-central Hypergeometric Distribution, = YTi=i Ti = 



( 6 ) 


where fit = k,\/i ^ C ■. Q < fif < nii. 

The solution pfi = {fit, fit, ■ ■ ■, fit) is an approximation to the mean pi. Fog stated the 
following properties of Equation ([^ : firstly, the solution pi* is valid under the conditions that 
Mi ^ C : mi > and uji > 0. Secondly, the mean given by Equation ([^ is a good approximation 
in most cases. Thirdly, Equation is exact when all oji are equal. 
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4 A Privacy Attacking on CF Recommender Systems 


In this section, we introduce a classic neighbourhood-based CF attacking, k Nearest Neighbour 
(fcNN) attack. Calandrino et al. presented a user-based attacking, k Nearest Neighbour 
(fcNN) attack, against the /cNN CF recommendation algorithm. Simply, fcNN attack exploits 
the property that the users are more similar when sharing same ratings on corresponding items 
to reveal user’s private data. 

We suppose that an attacker’s background knowledge consists of both the recommendation 
algorithm (fcNN CF recommendation) and its parameter k. Furthermore, a target user Mq’s 
partial non-sensitive history ratings, i.e., the ratings on m items that Ua voted, are known to 
the attacker. 

The aim of fcNN attack is to disclose Uq’s sensitive transactions that the attacker does 
not yet know about. To achieve this goal, the attacker firstly registers k fake users in a fcNN 
recommender system who only vote on Mq’s m non-sensitive items with same ratings of Ua- 
With a high probability, each fake user’s k nearest neighbours set Vfc(fake user) will include 
the other k — 1 fake users and the target user Ua- Because the target user Ua is the only 
neighbour who has ratings on the items which are not rated by the fake users, to provide 
recommendations on these items to the fake users, the recommender system has to give UaS 
rating to the fake users directly. Obviously, the fake users learn the target user UaS whole 
rating list successfully with A:NN attack. 


5 Privacy Preservation by Partitioned Probabilistic Neighbour 
Selection 

In this section, we firstly present the motivations and the goal of this paper. Then we provide 
two performance metrics on privacy preserving neighbourhood-based CF recommender systems 
against /cNN attack. Finally we propose our Partitioned Probabilistic Neighbour Selection 
algorithm based on our motivations and goal. 


5.1 Motivation 


Current research on privacy preserving neighbourhood-based CF recommender sys¬ 

tems applied different randomised strategies to improve the prediction accuracy, while ensure 
the security against fcNN attack by selecting the k neighbours across a target user’s par¬ 
tial/entire candidate list. Among these randomised strategies, differential privacy is a better 
privacy preserving mechanism as it provides calibrated magnitude of noise. 

Actually, since the information collected by recommender systems is always the customers’ 
personal data [^, preserving the users’ sensitive information should be the kernel issue of 
recommender systems. But none of the existing privacy preserving neighbourhood-based CF 
recommendation algorithms ensure a successful security-assured privacy preservation against 





/cNN attack before the process of CF recommendation. So in this paper, we present a security 
metric to measure the level of system security. 

In addition, the prediction accuracy should also be considered carefully with the guarantee 
of assured security, otherwise, the recommender systems would be useless to the non-malicious 
users who are the majority of customers. However, because of the introduction of global 
noise, the current randomised methods cannot guarantee the prediction accuracy either. To 
provide enough prediction utility, we have to decrease the noise as much as possible. Since 
there is no need to add noise into both the stage of neighbour selection and rating prediction, 
we may simply add Laplace noise to the final prediction rating after a regular A:NN CF. 
Unfortunately, as Sarathy et al. reported the security risk about the Laplace mechanism 
for numeric data, the above idea should be rejected. So we focus on adding noise at the stage 
of neighbour selection. Instead of global neighbour selection, we partition the order candidate 
list, so that we can control magnitude of noise inside each partition. 

Therefore, in this paper, we aim to propose a partitioned probabilistic (differential privacy) 
neighbour selection method, which guarantees an assured security, then achieves the maximum 
prediction accuracy with the assured security against /cNN attack, without any perturbations 
in the process of rating prediction. 

5.2 Performance Metrics 
5.2.1 Accuracy 

Naturally, in any neighbourhood-based CF recommender systems, aggregating the ratings of 
more similar users yields more reliable prediction. Therefore, we dehne the accuracy perfor¬ 
mance metric a as the similarity sum of the k neighbours of a target user Ua- Obviously, the 
greatest value of a would be the similarity sum of the k nearest candidates of a target user. 

It is simple to compute a in the deterministic neighbourhood-based CF algorithms, e.g. 
/cNN CF recommendation algorithm, because the k neighbours selected by the deterministic 
algorithms are determined. So in the case of deterministic algorithms, we compute a by the 
following equation, 

k 

a = sim{a, neighbouri). (7) 

i=l 

While, in the randomised neighbourhood-based CF algorithms, because of the randomisa¬ 
tion, we should calculate the value of a as the expected similarity sum of the k neighbours 
by 

k 

a = sim{a, neighbouri)). (8) 

i=l 

However, it is difficult to compute Equation Q directly, as we need to find all the possible 
/c-neighbour combinations and their corresponding probabilities. So we give another way to 
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compute the expectation in Equation Q, shown in blow: 


a = neighbouri)) 

= siw,(a,'userj)E(xi) (9) 

= YIi=i sim{a, useri)Hi, 


where hi = k, Hi £ [0,1). Section 3.3 introduced the definition of Xi and Hi- 

Actually, when usevi is selected as a neighbour of the target user Ua, Hi = 1) while when 
useri is not a neighbour of Ua, Hi = 0- Namely, in this paper, deterministic algorithms 
(Equation ([^) is a special case of randomised algorithms (Equation Q). Therefore, we com¬ 
pute the accuracy metric a by the following equation in both deterministic and randomised 
neighbourhood-based CE recommendation algorithms: 


a = YTi=\ sim{a, useri)Hi, 
k = hi- 


( 10 ) 


5.2.2 Security 

According to the property of fcNN attack, the purpose of a privacy preserving neighbourhood- 
based CE recommendation algorithm is to avoid the target user being the only real user in 
the final k neighbours set. Thus, the existing probabilistic privacy preserving solutions select 
the k neighbours across the partial/entire candidate list. It is obvious that the number of 
candidates who may be selected into the k neighbours set decides the success probability of 
/cNN attack (we call these candidates as potential neighbours). Namely, the more potential 
neighbours result in the less probability that the target user is the only real user in the final 
k neighbours set. On the other side, the attacker needs to create enough fake users to cover 
the potential neighbours set, so that the target user can be the only real user. That is to say, 
the more potential neighbours yield the higher attacking cost. In conclusion, in this paper, 
because we partition the candidate list by the given k, we define the number of user partitions, 
across which we select the k neighbours, as the security metric /3. 

5.3 Partitioned Probabilistic Neighbour Selection Scheme 

To achieve our goal, we will firstly provide the objective function with its constraints based 
on the discussions on both two performance metrics. Then, we propose the security-assured 
accuracy-maximised privacy preserving recommendation method by solving the objective func¬ 
tion according to its constraints. 

According to the security metric j3 and the properties of /cNN attack, we partition the 
entire candidate list of a user by the given k, i.e., the size of each partition (group) is k. 
Before providing the objective function, we introduce some variables in advance. We use fjsii) 
to denote the number of neighbours selected (weighted sampling with exponential differential 
privacy) from partition No. i with the given security metric /3, i G [l,/3]. Additionally, Oi 
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denotes the prediction accuracy of partition No. i against A:NN attack. Therefore, we have a 
general equation for a, 

/3 

a = ^ai. (11) 

i=l 

To solve the Equation (0 for the optimal a with the given security metric (3 against /cNN 
attack, we select one random fake user as the user who receives the system recommendation. 
We suppose the candidate list of the fixed fake user has already in a descending order of 
similarity. Figure shows the fixed fake user’s candidate list, where W denotes to the user set 
in partition i, z G [2,/3], Ua is the attacker’s target user. 


Partition Number 

1 

2 


15-1 

/5 

Partition Content 

Fake users + Ua 

N2 


Np-i 

Np 


Figure 1: Candidate list against fcNN attack 


According to formulas (10) and Figure]^ we have 


Oil — 

1=1 


( 12 ) 


where denotes the similarity between jth candidate in partition No. i and the fixed 

fake user, denotes the corresponding mean //, in G [I,/?]- Moreover, because we aim to 
select fp{i) neighbours from partition No. i, = ffsii)- 

Combining Equation 0 and Equation (|12[), we have 


/3 k 
^=1 j=l 


(13) 


Since the similarity between the candidates in partition No. 1 and the fixed fake users is 
absolutely one, we rewrite the above equation as 


a = 


/3 k 

/s(i) + EE 

i=2 j=l 


(14) 


Obviously, the Equation (14) is our objective function against fcNN attack. 


Now we give the constraints of Equation (14). Since we need to select the k neighbours 


across the top j3 partitions, we should select at least one neighbour from partition No. /3, 
i.e., fjsiP) = Yli=i > 1- As the candidate list is in a descending order of similarity, 

and we select one neighbour from the partition No. /3, to cover all the top partitions, the 
attacker needs to create at least jSk fake users, no matter how many neighbours are selected 
from the partition No. i, i € [l,/3 — 1]. So we can select zero neighbour from the partition 
No. z, z G [l,/3 — 1]. In addition, because fpiP) > 1 and J2i=i //?(*) < k — 1 
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for i G [l,/3 — 1]. Recalling the other constraints we presented previously, we have the final 
objective function with constraints as follow: 


maximise 
subject to 


a = + Ef=2 Ei=i 

'Lj=i = fait) 

Ef.i *(;) = * 

[l,k], i = /3 
[0,k- 1], i G [l,/3) 




(15) 


Then, we solve Linear Programming (15) as a Knapsack Problem with the property of 


Equation ([^. The solution, that is the partitioned probabilistic neighbour selection method 
which guarantees the optimal expectation of prediction accuracy a with a given security metric 
/3 against /cNN attack is: 

k — i = 1 

1 , i = f3 . (16) 

0, iG(l,/3) 

Note that because V /3 > 1, the candidate list of any user is in a descending order of similarity, 
formula 


// 3 (^) = ^ 


will always be the optimal solution to Linear Programming (15) for any j3>\. 


Algorithm[^demonstrates the Partitioned Probabilistic Neighbour Selection (PPNS) method. 
From line 1 to line 5, we compute the necessary parameters by Equation (|^, ([^, ([^ and (Q. 
We select the k neighbours from each partition with exponential differential privacy by Par¬ 
titioned Probabilistic Neighbour Selection (Equation (|16[)) in line 6. Next, once we have the 
k neighbours of target user Ua, we compute the prediction rating of Ua on a item r^, Vax, by 
Equation ([^ in line 7. Finally, we return the neighbour set Nk{ua) and the prediction rating 


6 Experimental Evaluation 

In this section, we use the real-world datasets to evaluate the performance on both accuracy 
and security of our Partitioned Probabilistic Neighbour Selection method. We begin by the 
description of the datasets, then introduce the evaluation metric, finally perform a compar¬ 
ative analysis of our method and some existing privacy preserving neighbourhood-based CF 
recommendation algorithms. 

6.1 Data set and Evaluation Metric 

In the experiments, we use two real-world datasets, MovieLens datase10 and Doubarj^ (one of 
the largest rating websites in China) film datasef]^ The MovieLens dataset consists of 100,000 

^ http: //www.grouplens.org/datasets / movielens / 

^http://www.douban.com 

^https://www.cse.cuhk.edu.hk/irwin.king/pub/data/douban 
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Algorithm 1 Partitioned Probabilistic Neighbour Selection. 

Input: 

Original user-item rating set, TZ] 

Target user, Ua and prediction item, 

Number of neighbours, k; 

Differential privacy parameter, e; 

Security metric, (3. 

Output: 

Target user Ua’s A:-neighbour set, (?/„); 

Prediction rating of Ua on tx, Tax- 
1 : Compute the similarity array for target user Sa, 

2 : Sort Sa in descending order, S^; 

3: Compute exponential differential privacy sensitivity, RS; 
4: Compute each user m’s selection weight, a;*; 

5: Partition the sorted by k] 

6 : Select k neighbours from top f3 partitions; 

7: Compute rax by Nk{ua); 

8: return Nk{Ua), Tax] 


ratings (1-5 integral stars) from 943 users on 1682 films, where each user has voted more 
than 20 films, and each film received 20—250 users’ rating. The Douban film dataset contains 
16,830,839 ratings (1-5 integral starts) from 129,490 unique users on 58,541 unique films [Is] . 

We use a famous measurement metric, Mean Absolute Error (MAE) [26[|27| , to measure 
the recommendation accuracy, in the experiments: 

1 ^ ^ 

MAE = mET. \rij-fij\, (17) 

i=l j=l 

where rij is the real rating of user Ui on item tj, and Vij is the corresponding predicted 
rating from recommendation algorithms, U and I denote the number of users and items in the 
experiments. Specifically, in user-based experiments, we compute the MAE of ratings from 200 
random users {U = 200) on all the items (/ = 1682 or / = 58, 541) in the two datasets, while, 
in item-based experiments, we compute the MAE of ratings on 200 random items (/ = 200) 
from all the users (U = 943 or U = 129,490) in the two datasets. In addition, we only predict 
the fij for the Vij ^ 0. Obviously, a lower MAE denotes a higher prediction accuracy, e.g., 
MAE = 0 means the prediction is totally correct because the prediction ratings equal to the 
real ratings, but no privacy guarantee against A;NN attack. 

6.2 Experimental Results 

In this section, we show the accuracy performance from different perspectives of four main 
neighbourhood-based CE methods, i.e., k Nearest Neighbour (/cNN), naive Probabilistic Neigh¬ 
bour Selection (nPNS) [^, Private Neighbour CP (PNCP) and our method, Partitioned 
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Probabilistic Neighbour Selection (PPNS). Due to the similarity metric (Cosine-based similar¬ 
ity, Equation Q) used in this paper, in the second half of a candidate list, a large number of 
candidates’ similarity will be zero which is useless for prediction. So in the experiments, we 
set the upper bound oi (3 as U/2k (user-based prediction) or I/2k (item-based prediction). 

6.2.1 Accuracy performance with no attacking 

We design three experiments (Figure]^ - Figure]^ to examine the user-based and item-based 
CF prediction accuracy on MovieLens dataset and Douban film dataset. As seen in Figureto 
Figure]^ we notice that our privacy preserving method (PPNS) achieves much better accuracy 
performance than the two global methods (nPNS and PNCF) in both the two datasets on both 
user-based and item-based CF. Moreover, as a trade-off between the prediction accuracy and 
system security in PPNS, a greater security metric /3 results in a greater MAE which means 
a worse prediction accuracy. Specifically, when /3 = 1, PPNS achieves the same prediction 
accuracy with the /cNN method which is regarded as the baseline neighbourhood-based CF 
recommendation method in this paper. 



/9 


Figure 2: Item-based prediction accuracy on MovieLens (e = 1, A: = 100) 


6.2.2 Accuracy performance against /cNN attack 

To examine the accuracy performance of the four methods against /cNN attack with the same 
security guarantee, we introduce a fixed security metric j3 to the three privacy preserving CF 
algorithms (nPNS, PNCF, PPNS). That is, we randomly select k neighbours from the /ik 
nearest candidates with weighted sampling in nPNS; we calculate A as sirri}^ — simpk in PNCF; 
and we select the k neighbours across the top /3 partitions by Algorithm in PPNS. The 
experiments are run on user-based CF because /cNN attack is a user-based attacking. 
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Figure 3: User-based prediction accuracy on MovieLens (e = 1, fc = 100) 



/3 


Figure 4: User-based prediction accuracy on Douban film (e = 1, k = 100) 

Figure shows that to ensure the same security guarantee against /cNN attack, PPNS 
performs much better on the prediction accuracy than the other privacy preserving CF methods 
(nPNS and PNCF). Moreover, the MAE performance of the feNN method indicates that /cNN 
CF does not provide any security guarantee against fcNN attack. Additionally, as we regard 
/3 as security metric, we observe that we achieve a trade-off between accuracy and security, 
because the greater /3 yields a greater MAE which denotes less prediction accuracy. 
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Figure 5: Prediction accuracy on MovieLens against /cNN attack [e = 1, k = 50, m = 8) 


Figure demonstrates the impacts of recommendation parameter k on the prediction ac¬ 
curacy. We examine the value of k from 10 to 100, which is a popular range for the recommen¬ 
dation parameter k. From Figure]^ we can see that a larger size of neighbour set (or the size 
of partition in PPNS) denotes the better prediction accuracy of PPNS method against fcNN 
attack. 



Figure 6: Impacts of k on prediction accuracy against A:NN attack on MovieLens (e = 1, 

m = 8, f3 = 7) 


Figure [7] illustrates the impacts of differential privacy budge e on the prediction accuracy. It 
is observed that as e increases, the MAE performance improves in the two differential privacy 
methods (PNCF and PPNS). So to achieve a better prediction accuracy, it is suggested to set 
a greater e against /cNN attacks. 
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Figure 7: Impacts of e on prediction accuracy against A:NN attack on MovieLens {k = 50, 

m = 8, (5 = 7) 


Figure [^presents the impacts of attacking parameter m on the prediction accuracy, we can 
note that to reveal a target customer’s privacy by fcNN attack, the attacker needs at least 2^ 
real ratings of the target customer as auxiliary information, since when m > 8, the MAE of 
a non-privacy preserving CF (feNN) method is zero. When the attacker has more background 
knowledge, the prediction will be closer to the real ratings for all of the neighbourhood-based 
CF systems, but none of privacy preserving algorithms releases the customer’s privacy. 



Figure 8: Impacts of m on prediction accuracy against A:NN attack on MovieLens (e = 1, 

A; = 50, /3 = 7) 
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7 Conclusion 


Recommender systems play an important role in Internet commerce since the first decade of 
21st century. To protect customers’ private information against /cNN attack during the process 
of filtering, the existing privacy preserving neighbourhood-based CF recommendation methods 
[T [^, 27 introduced global noise into the covariance matrix and the process of neighbour 
selection. However, they neither ensure the prediction accuracy because of the global noise, nor 
guarantee an assured security enforcement before the collaborative filtering against /cNN attack. 
To overcome the weaknesses of the current probabilistic methods, we propose a novel privacy 
preserving neighbourhood-based CF method. Partitioned Probabilistic Neighbour Selection, to 
ensure a required security while achieving the optimal prediction accuracy against feNN attack. 
The theoretical and experimental analysis show that achieving the same security guarantee 
against A:NN attack, our method ensures the optimal performance of recommendation accuracy 
among the current randomised neighbourhood-based CF recommendation methods. 
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